Harvard T.H. Chan School of Public Health
Consider a function (i.e. conditional mean)
If we sample function values with errors…
… how can we learn the original form?
Bins : 1
Bins : 2
Bins : 4
Bins : 6
Bins: 8
Bins: 10
λ = 0
λ = 1.6e-5
λ = 3.1e-5
λ = 5.0e-5
λ = 8.7e-5
λ = 0.00017
Tree-based models:
HAL:
Parametric \(f_\theta(x)\)
Additive \(\sum f_j(x_j)\)
Cadlag (\(M < \infty\))
Lipschitz
No assumptions
Linear regression
GAM / 1-D spline
HAL
Honest forest, BART
Random forest
\(\sqrt{n}\)
\(\sqrt[3]{n}\)
\(\sqrt[4]{n}\)
\(n^{-2 / (d + 2)}\)
Possibly biased
Binomial CDF
“cadlag” := right continuous, left limits
Binomial CDF + Normal CDF
“cadlag” := right continuous, left limits
\(\sum_{i=1}^n g(x_i) (F(x_{i}) - F(x_{i-1}))\)
\(\approx \int g(x)dF(x)\) is cadlag
NOT finite sectional variation:
Representation theorem (Benkeser and van der Laan 2016; van der Laan 2017): All cadlag functions can be written
\[f = \sum_{S \subset \{1,\ldots,d\}}\int I(u \leq x_S)f_S(du)\]
Approximating via empirical measure yields
\[f_n(x) = \sum_S\sum_{i=1}^n I(x \leq x_{S, i})\beta_{S, i} = \Phi(x)\beta\]
\[\arg\min_{||\beta||_1 < M} \frac{1}{n}\sum_{i=1}^n L(\Phi(x)\beta)\]
Pros:
glmnet to optimize common loss functionsCon: Must enumerate basis \(\Phi(x)\). How big is it?
\(n\) basis functions \(\times\) \(2^d - 1\) sections =
\(\Huge{O(n\cdot 2^d)}\)
space complexity
Idea: Randomly sample basis functions to approximate HAL
RandomHAL: Fit \(f_{m,n}\) as follows:
\(\Phi_m(x) \subset \Phi(x)\), but does it converge?
\[\mathcal{F}_1 \subset \mathcal{F}_2 \subset \ldots \subset \mathcal{F}_m \subset \mathcal{F}\]
Clearly RandomHAL satisfies this
Theorem 1: Given nested risk minimization, \[\lVert f - f_{m,n} \rVert \leq \lVert f - f_n \rVert + \lVert f_n - f_{m,n} \rVert = o_P(\min(r_n, r_m))\]
where
If \(r_m\) faster than \(r_n\), then \(f_{m,n}\) preserves rate of \(f_n\)
(Nesterov 2012): if loss is strongly convex (grows faster than quadratic instead of linear), then random coordinate descent satisfies
\[||f_{m,n} - f_n||^2 < C_1\Big(1 - \frac{C_2}{k}\Big)^m\]
so for HAL, actually only need
\[m = o(n\cdot \log(n))\]
Sample simulation: 200 iterations with \(d = 6\)
\[\begin{align*} X_1, \ldots, X_4 &\sim \text{Beta}(\alpha, \beta) \\ X_5 &\sim \text{Ber}(0.5) \\ A &\sim \text{Ber}(\text{expit}(X_1X_5 + \sum_{k=1}^4 (X_k + X_k^2))) \\ Y &\sim \text{Normal}(A + X_5A + 2X_3X_4\\ & + \sum_{k=1}^4 (X_k - X_k^{3/2}), 0.3) \\ \end{align*}\]
Does RandomHAL still work when there are too many variables to fit HAL?
Error: from glmnet Fortran code (error code 7777); All used predictors have zero variance
Recall step 1.a: Draw section size \(D \sim Binomial(d, p)\).
\(p = 0.5\)
Recall step 1.a: Draw \(D \sim Binomial(d, p)\).
\(p = 0.05\)
20 replicates with \(p = 0.05\) of
\[\begin{align*} & X_1, \ldots, X_{50} \sim \text{Beta}(2, 2) \\ & A \sim \text{Ber}(0.1 + 0.8I(\sum_{k=1}^{50} 0.2(X_k - X_k^{3/2}) > 1)) \\ & Y \sim \text{Normal}((A+10)\sum_{k=1}^{50} 0.2(X_k - X_k^{3/2}), 0.1) \\ \end{align*}\]